{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import lightgbm as lgbm\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from MLFeatureSelection import sequence_selection, importance_selection, coherence_selection,tools"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this Demo notebook, three major selection methods are introducing including \n",
    "\n",
    "- ***sequece selection***\n",
    "\n",
    "- ***importance selection***\n",
    "\n",
    "- ***coherence selection***\n",
    "\n",
    "and tools like ***readlog***\n",
    "\n",
    "you can check the ***algorithms details*** [here](https://github.com/duxuhao/Feature-Selection/blob/master/Algorithms_Graphs)\n",
    "\n",
    "the detailed ***parameters and functions*** [here](https://github.com/duxuhao/Feature-Selection/blob/master/MLFeatureSelection)\n",
    "\n",
    "and more ***examples*** [here](https://github.com/duxuhao/Feature-Selection/tree/master/example)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the dataset for evaluating the speech intelligibility in a classroom. Details of the dataset is availabel [here](https://github.com/duxuhao/Classroom-Acoustics-Research)\n",
    "\n",
    "#### 1. Read your dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read():\n",
    "    df = pd.read_csv('Example/Speech_Intelligibility/CRS.csv')\n",
    "    return df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Define the required loss function based on your requirement, here we use the MAE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def lossfunction(y_pred, y_test):\n",
    "    \"\"\"define your own loss function with y_pred and y_test\n",
    "    return score\n",
    "    \"\"\"\n",
    "    return np.mean(np.abs(y_pred - y_test)/y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. Define your validation method. It is quite flexible here because you can do whatever you want as long as it return\n",
    "\n",
    "- evaluation score\n",
    "- last classifier or estimator\n",
    "\n",
    "you can do ***k-fold***, ***70%-30%*** validations, etc. In this example, we just simplify the validation with all data training and testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def validate(X, y, features, clf, lossfunction):\n",
    "    \"\"\"define your own validation function with 5 parameters\n",
    "    input as X, y, features, clf, lossfunction\n",
    "    clf is set by SetClassifier()\n",
    "    lossfunction is import earlier\n",
    "    features will be generate automatically\n",
    "    function return score and trained classfier\n",
    "    \"\"\"\n",
    "    clf.fit(X[features],y)\n",
    "    y_pred = clf.predict(X[features])\n",
    "    score = lossfunction(y,y_pred)\n",
    "    return score, clf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4. Define the selection methods you use, you can use single method or combination"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 4.1. sequence selection\n",
    "\n",
    "- initialized selector with wanted process (sequence, random, cross features) [details](https://github.com/duxuhao/Feature-Selection/blob/master/Algorithms_Graphs/sequence_selection.png)\n",
    "\n",
    "- import dataframe and define the label name\n",
    "\n",
    "- import loosfunction and improve direction ('ascend' or 'descend') based on evaluation metrics (accuracy, logloss, etc)\n",
    "\n",
    "- if cross features process is selected, import dictionary of cross method\n",
    "\n",
    "- define features that are not trainable\n",
    "\n",
    "- define list initial features combination (start with [] will select from scratch, start with all features will do backward searching at the beginning)\n",
    "\n",
    "- generate candidate features list\n",
    "\n",
    "- can set time limit or features limit\n",
    "\n",
    "- define selected estimator\n",
    "\n",
    "- set the log file name\n",
    "\n",
    "- start running"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "define dictionary of cross methods\n",
    "\"\"\"\n",
    "\n",
    "def add(x,y):\n",
    "    return x + y\n",
    "\n",
    "def substract(x,y):\n",
    "    return x - y\n",
    "\n",
    "def times(x,y):\n",
    "    return x * y\n",
    "\n",
    "def divide(x,y):\n",
    "    return (x + 0.001)/(y + 0.001)\n",
    "\n",
    "def sq(x,y):\n",
    "    return x ** 2\n",
    "\n",
    "\n",
    "CrossMethod = {#'+':add,\n",
    "               #'-':substract,\n",
    "               '*':times,\n",
    "               #'/':divide,\n",
    "               #'^': sq,\n",
    "               }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def seq(df,f, notusable,estimator):\n",
    "    sf = sequence_selection.Select(Sequence = True, Random = False, Cross = True) #initialized selector with wanted process\n",
    "    sf.ImportDF(df,label = 'SI') #import dataframe and define the label name\n",
    "    sf.ImportLossFunction(lossfunction, direction = 'descend') #import loosfunction and improve direction\n",
    "    sf.ImportCrossMethod(CrossMethod) #import dictionary of cross method\n",
    "    sf.InitialNonTrainableFeatures(notusable) #define features that are not trainable\n",
    "    sf.InitialFeatures(f) #define list initial features combination\n",
    "    sf.GenerateCol() #generate candidate features list\n",
    "    sf.clf = estimator #define selected estimator\n",
    "    sf.SetLogFile('record_seq.log') #set the log file name\n",
    "    return sf.run(validate) #start running"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 4.2. importance selection\n",
    "\n",
    "- initialized selector [details](https://github.com/duxuhao/Feature-Selection/blob/master/Algorithms_Graphs/importance_selection.png)\n",
    "\n",
    "- import dataframe and define the label name\n",
    "\n",
    "- import loosfunction and improve direction ('ascend' or 'descend') based on evaluation metrics (accuracy, logloss, etc)\n",
    "\n",
    "- define remove features quantity each iteration\n",
    "\n",
    "- can set time limit\n",
    "\n",
    "- define selected estimator\n",
    "\n",
    "- set the log file name\n",
    "\n",
    "- start running"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def imp(df,f,estimator):\n",
    "    sf = importance_selection.Select() #initialized selector\n",
    "    sf.ImportDF(df,label = 'SI') #import dataset\n",
    "    sf.ImportLossFunction(lossfunction, direction = 'descend')  #import loosfunction and improve direction\n",
    "    sf.InitialFeatures(f)  #define list initial features combination\n",
    "    sf.SelectRemoveMode(batch = 1) #define remove features quantity each iteration\n",
    "    sf.clf = estimator #define selected estimator\n",
    "    sf.SetLogFile('record_imp.log') #set the log file name\n",
    "    return sf.run(validate) #start running"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 4.3. coherence selection\n",
    "\n",
    "- initialized selector [details](https://github.com/duxuhao/Feature-Selection/blob/master/Algorithms_Graphs/coherence_selection.png)\n",
    "\n",
    "- import dataframe and define the label name\n",
    "\n",
    "- import loosfunction and improve direction ('ascend' or 'descend') based on evaluation metrics (accuracy, logloss, etc)\n",
    "\n",
    "- define remove features quantity each iteration and the removable criteria\n",
    "\n",
    "- can set time limit\n",
    "\n",
    "- define selected estimator\n",
    "\n",
    "- set the log file name\n",
    "\n",
    "- start running"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def coh(df,f,estimator):\n",
    "    sf = coherence_selection.Select() #initialized selector\n",
    "    sf.ImportDF(df,label = 'SI') #import dataset\n",
    "    sf.ImportLossFunction(lossfunction, direction = 'descend') #import loosfunction and improve direction \n",
    "    sf.InitialFeatures(f) #define list initial features combination\n",
    "    sf.SelectRemoveMode(batch=1, lowerbound = 0.5) #define remove features quantity each iteration and selection threshold\n",
    "    sf.clf = estimator #define selected estimator\n",
    "    sf.SetLogFile('record_coh.log') #set the log file name\n",
    "    return sf.run(validate) #start running"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5. Run combination features selection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run():\n",
    "    df = read() # read dataset\n",
    "    notusable = ['SI'] #not trainable features  \n",
    "    f = tools.readlog('record2.log',0.086342) # use readlog to read the out log (filename, required score)\n",
    "    #f = ['SNR','BN'] #initial features combination\n",
    "    clf = lgbm.LGBMRegressor(num_leaves=35, max_depth=-1)\n",
    "    uf = f[:]\n",
    "    print('sequence selection')\n",
    "    uf = seq(df, uf, notusable,clf)\n",
    "    print('importance selection')\n",
    "    uf = imp(df,uf,clf)\n",
    "    print('coherence selection')\n",
    "    uf = coh(df,uf,clf)\n",
    "    return uf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sequence selection\n",
      "Features Quantity Limit: inf\n",
      "Time Limit: inf min(s)\n",
      "test performance of initial features combination\n",
      "Mean loss: 0.08634175498285059\n",
      "--------------------start greedy--------------------\n",
      "SNR\n",
      "SPL\n",
      "RT\n",
      "G\n",
      "******************** 5 round ********************\n",
      "BN\n",
      "0/1\n",
      "Mean loss: 0.08597677125607062\n",
      "SNR\n",
      "reverse 0/4\n",
      "Mean loss: 0.0869340780457153\n",
      "SPL\n",
      "reverse 1/4\n",
      "Mean loss: 0.08607867917223372\n",
      "RT\n",
      "reverse 2/4\n",
      "Mean loss: 0.09211142301516553\n",
      "G\n",
      "reverse 3/4\n",
      "Mean loss: 0.08948553448408178\n",
      "******************** 6 round ********************\n",
      "SNR\n",
      "reverse 0/4\n",
      "Mean loss: 0.0869340780457153\n",
      "SPL\n",
      "reverse 1/4\n",
      "Mean loss: 0.08607867917223372\n",
      "RT\n",
      "reverse 2/4\n",
      "Mean loss: 0.09211142301516553\n",
      "G\n",
      "reverse 3/4\n",
      "Mean loss: 0.08948553448408178\n",
      "--------------------complete greedy--------------------\n",
      "random select starts with:\n",
      " ['SNR', 'SPL', 'RT', 'G', 'BN']\n",
      " score: 0.08597677125607062\n",
      "small cycle cross\n",
      "0/25\n",
      "Mean loss: 0.08591000858005234\n",
      "1/25\n",
      "Mean loss: 0.08595127270999792\n",
      "2/25\n",
      "Mean loss: 0.08590996269800664\n",
      "3/25\n",
      "Mean loss: 0.08582540784416814\n",
      "4/25\n",
      "Mean loss: 0.08582540784416814\n",
      "5/25\n",
      "Mean loss: 0.08583220487013331\n",
      "6/25\n",
      "Mean loss: 0.08569959061258527\n",
      "7/25\n",
      "Mean loss: 0.0854100031801726\n",
      "8/25\n",
      "Mean loss: 0.0854100031801726\n",
      "9/25\n",
      "Mean loss: 0.0854100031801726\n",
      "10/25\n",
      "Mean loss: 0.08526573694974951\n",
      "11/25\n",
      "Mean loss: 0.08527422653843736\n",
      "12/25\n",
      "Mean loss: 0.08526573694974951\n",
      "13/25\n",
      "Mean loss: 0.08526573694974951\n",
      "14/25\n",
      "Mean loss: 0.08526573694974951\n",
      "15/25\n",
      "Mean loss: 0.08526555398853142\n",
      "16/25\n",
      "Mean loss: 0.08526555398853142\n",
      "17/25\n",
      "Mean loss: 0.08529623671779707\n",
      "18/25\n",
      "Mean loss: 0.08526555398853142\n",
      "19/25\n",
      "Mean loss: 0.0852839568176554\n",
      "20/25\n",
      "Mean loss: 0.08526555398853142\n",
      "21/25\n",
      "Mean loss: 0.08526555398853142\n",
      "22/25\n",
      "Mean loss: 0.08526555398853142\n",
      "23/25\n",
      "Mean loss: 0.08529778889677891\n",
      "24/25\n",
      "Mean loss: 0.08526555398853142\n",
      "test performance of initial features combination\n",
      "--------------------start greedy--------------------\n",
      "SNR\n",
      "SPL\n",
      "RT\n",
      "G\n",
      "BN\n",
      "(SNR*BN)\n",
      "(SNR*RT)\n",
      "(SNR*SPL)\n",
      "(SPL*G)\n",
      "(SPL*RT)\n",
      "(RT*BN)\n",
      "(G*BN)\n",
      "******************** 13 round ********************\n",
      "SNR\n",
      "reverse 0/11\n",
      "Mean loss: 0.08527453753734233\n",
      "SPL\n",
      "reverse 1/11\n",
      "Mean loss: 0.08525077379741682\n",
      "RT\n",
      "reverse 2/10\n",
      "Mean loss: 0.08529438015755389\n",
      "G\n",
      "reverse 3/10\n",
      "Mean loss: 0.08525077379741682\n",
      "BN\n",
      "reverse 4/10\n",
      "Mean loss: 0.08532384094182703\n",
      "(SNR*BN)\n",
      "reverse 5/10\n",
      "Mean loss: 0.08528437731437818\n",
      "(SNR*RT)\n",
      "reverse 6/10\n",
      "Mean loss: 0.08526821270588641\n",
      "(SNR*SPL)\n",
      "reverse 7/10\n",
      "Mean loss: 0.08527726314029775\n",
      "(SPL*G)\n",
      "reverse 8/10\n",
      "Mean loss: 0.0852621090353776\n",
      "(SPL*RT)\n",
      "reverse 9/10\n",
      "Mean loss: 0.08526011903878643\n",
      "(RT*BN)\n",
      "reverse 10/10\n",
      "Mean loss: 0.08548033355801461\n",
      "******************** 12 round ********************\n",
      "SPL\n",
      "0/1\n",
      "Mean loss: 0.08526555398853142\n",
      "SNR\n",
      "reverse 0/10\n",
      "Mean loss: 0.0852652903426519\n",
      "RT\n",
      "reverse 1/10\n",
      "Mean loss: 0.08529438015755389\n",
      "G\n",
      "reverse 2/10\n",
      "Mean loss: 0.08525077379741682\n",
      "BN\n",
      "reverse 3/10\n",
      "Mean loss: 0.08532384094182703\n",
      "(SNR*BN)\n",
      "reverse 4/10\n",
      "Mean loss: 0.08528437731437818\n",
      "(SNR*RT)\n",
      "reverse 5/10\n",
      "Mean loss: 0.08526821270588641\n",
      "(SNR*SPL)\n",
      "reverse 6/10\n",
      "Mean loss: 0.08527726314029775\n",
      "(SPL*G)\n",
      "reverse 7/10\n",
      "Mean loss: 0.0852621090353776\n",
      "(SPL*RT)\n",
      "reverse 8/10\n",
      "Mean loss: 0.08526011903878643\n",
      "(RT*BN)\n",
      "reverse 9/10\n",
      "Mean loss: 0.08548033355801461\n",
      "--------------------complete greedy--------------------\n",
      "random select starts with:\n",
      " ['SNR', 'RT', 'G', 'BN', '(SNR*BN)', '(SNR*RT)', '(SNR*SPL)', '(SPL*G)', '(SPL*RT)', '(RT*BN)', '(G*BN)']\n",
      " score: 0.08525077379741682\n",
      "*-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-*\n",
      "best score:0.08525077379741682\n",
      "best features combination: ['SNR', 'RT', 'G', 'BN', '(SNR*BN)', '(SNR*RT)', '(SNR*SPL)', '(SPL*G)', '(SPL*RT)', '(RT*BN)', '(G*BN)']\n",
      "importance selection\n",
      "Features Quantity Limit: inf\n",
      "Time Limit: inf min(s)\n",
      "test performance of initial features combination\n",
      "remove features: baseline\n",
      "Mean loss: 0.08525077379741682\n",
      "Remove Batch: 1\n",
      "remove features: ['(SNR*SPL)']\n",
      "Mean loss: 0.08527726314029775\n",
      "remove features: ['(SPL*G)']\n",
      "Mean loss: 0.08528521963622046\n",
      "remove features: ['SNR']\n",
      "Mean loss: 0.08527581758361784\n",
      "remove features: ['G']\n",
      "Mean loss: 0.08527581758361784\n",
      "remove features: ['(SNR*RT)']\n",
      "Mean loss: 0.085247955716231\n",
      "remove features: ['RT']\n",
      "Mean loss: 0.08535047616265658\n",
      "remove features: ['(G*BN)']\n",
      "Mean loss: 0.0894309351211068\n",
      "remove features: ['(SNR*BN)']\n",
      "Mean loss: 0.08962612978325311\n",
      "remove features: ['BN']\n",
      "Mean loss: 0.0897996883109652\n",
      "remove features: ['(SPL*RT)']\n",
      "Mean loss: 0.10308506394665873\n",
      "*-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-*\n",
      "best score:0.085247955716231\n",
      "best features combination: ['RT', 'BN', '(SNR*BN)', '(SPL*RT)', '(RT*BN)', '(G*BN)']\n",
      "coherence selection\n",
      "Features Quantity Limit: inf\n",
      "Time Limit: inf min(s)\n",
      "test performance of initial features combination\n",
      "remove features: baseline\n",
      "Mean loss: 0.085247955716231\n",
      "Remove Batch: 1\n",
      "totally 4 features above 0.5\n",
      "Delete (RT*BN) with coherence 0.9503245519492791\n",
      "remove features: ['(RT*BN)']\n",
      "Mean loss: 0.08546699279740141\n",
      "Delete (SPL*RT) with coherence 0.8053825819089316\n",
      "remove features: ['(SPL*RT)']\n",
      "Mean loss: 0.08591391363465142\n",
      "*-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-**-*\n",
      "best score:0.085247955716231\n",
      "best features combination: ['RT', 'BN', '(SNR*BN)', '(SPL*RT)', '(RT*BN)', '(G*BN)']\n"
     ]
    }
   ],
   "source": [
    "bf = run()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['RT', 'BN', '(SNR*BN)', '(SPL*RT)', '(RT*BN)', '(G*BN)']\n"
     ]
    }
   ],
   "source": [
    "print(bf)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
